1 Libraries

# session info
# -- R version 3.6.3 (2020-02-29)

# libraries
library(dplyr) # Version ‘0.8.5’
library(ggplot2) # Version '3.3.0'
# library(stringr) # Version '1.4.0'
library(gam) # Version '1.16.1'
library(MASS) # Version '7.3-51.5'
library(DHARMa) # Version '0.3.1'
# DHARMa vignette: https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html
library(AER) # Version '1.2-9'
library(statR) # Version ‘0.0.0.9000’

2 Data import

# Data set for 2019
miv_2019 <- readr::read_csv("./miv.csv")

# Data set for 2020
miv_2020 <- readr::read_csv("./Mobility_VerkehrsmessstellenKantonZH.csv")

3 Data preparation

3.1 Daily values for 2019

head(miv_train_day)

3.2 Daily values for 2020

head(miv_test)

4 Plot measured data

4.1 Daily values of flow for 2019

4.2 Daily values of flow for 2020

5 Graphical Analysis

5.1 Density plot – Check if the response variable is close to normal

Skewness = -061 -> Negative- or left-skewed distribution Flow data is negatively skewed: We see a large share of relatively high values, but also abou t one third of time intervals with smaller numbers of vehicles per day. Surprisingly no zeros.

With count data that follows a Poisson distribution, a square root transformation is recommended. In this case, the square root transformation does not result in an improvement on teh distribution. For the model fitting, we will go with the original floa data.

5.2 BoxPlot – Check for outliers

The square root transformation does not distribute the flow values more evenly. The original data is more evenly distributed - outliers primarily in the lower end of the value range.

5.3 Correlation

Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair. It can take values between -1 and +1. A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) suggests that much of variation of the response variable is unexplained by the predictor, in which case, we should probably look for a better predictor.

According to correlation, most variation of the flow_daysum variable is explained by the predictor weekyday_num. But we see that there is at least some sort of relationship with all explanatory variables.

6 Build Linear Model

As the correlation matrix suggests all explenatory variables do have an influence on the response variable, the linear model will include all variables.

linearMod <- lm(flow_daysum ~ month +  weekday + holiday + schoolholiday, data=miv_train_day)

6.1 Results

  • p Value: The model p-Value and almost all the p-Values of individual predictor variables (besides some weekdays) are smaller then 0.05. We can conclude that our model is indeed statistically significant. There exists a relationship between the independent variable in question and most of the dependent variables.

  • Adj. R-squared: 0.8872 -> ~88% of variation in the response variable has been explained by this model.

  • Accuracy: Besides looking at the adj. R-squared for efficiency of the model, it’s a better practice to look at the AIC and prediction accuracy on validation sample for an accuracy measure.

  • Goodness-of-fit (AIC): AIC(linearMod) => 3005.332. Rule of thumb: the lower the better.

## 
## Call:
## lm(formula = flow_daysum ~ month + weekday + holiday + schoolholiday, 
##     data = miv_train_day)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3709.7  -411.8    -4.6   510.4  2590.4 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      14374.32     245.11  58.644  < 2e-16 ***
## month              561.09      42.21  13.293  < 2e-16 ***
## weekdayMonday    -1342.91     265.20  -5.064 1.06e-06 ***
## weekdaySaturday  -3873.14     262.59 -14.750  < 2e-16 ***
## weekdaySunday    -5808.48     262.39 -22.136  < 2e-16 ***
## weekdayThursday    -22.26     262.00  -0.085    0.932    
## weekdayTuesday    -247.26     262.02  -0.944    0.347    
## weekdayWednesday    23.16     262.41   0.088    0.930    
## holiday          -5861.86     377.31 -15.536  < 2e-16 ***
## schoolholiday    -1555.41     188.86  -8.236 4.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 944.6 on 171 degrees of freedom
## Multiple R-squared:  0.8928, Adjusted R-squared:  0.8872 
## F-statistic: 158.3 on 9 and 171 DF,  p-value: < 2.2e-16

7 GLM Model

7.1 Poisson

poissonMod <- glm(flow_daysum ~ month +  weekday + holiday + schoolholiday, family="poisson", data = miv_train_day)
## 
## Call:
## glm(formula = flow_daysum ~ month + weekday + holiday + schoolholiday, 
##     family = "poisson", data = miv_train_day)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -32.058   -3.600    0.305    4.201   25.765  
## 
## Coefficients:
##                    Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       9.5652708  0.0021352 4479.704   <2e-16 ***
## month             0.0393792  0.0003743  105.214   <2e-16 ***
## weekdayMonday    -0.0891981  0.0022912  -38.930   <2e-16 ***
## weekdaySaturday  -0.2778933  0.0023688 -117.316   <2e-16 ***
## weekdaySunday    -0.4484593  0.0024827 -180.637   <2e-16 ***
## weekdayThursday  -0.0005565  0.0022066   -0.252    0.801    
## weekdayTuesday   -0.0187994  0.0022155   -8.486   <2e-16 ***
## weekdayWednesday -0.0015176  0.0022148   -0.685    0.493    
## holiday          -0.4839485  0.0040583 -119.250   <2e-16 ***
## schoolholiday    -0.1200451  0.0017691  -67.858   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 106307  on 180  degrees of freedom
## Residual deviance:  12289  on 171  degrees of freedom
## AIC: 14368
## 
## Number of Fisher Scoring iterations: 4

7.1.1 Results

  • Checking for overdispersion: residual deviance is 12289 for 171 degrees of freedom.

  • The ratio of deviance to df should be around 1, here it’s 71.86, indicating overdispersion.

7.1.2 Explanation

A common reason for overdispersion is a misspecified model, meaning the assumptions of the model are not met, hence we cannot trust its output (e.g. p-Values). When overdispersion is detected, one should therefore first search for problems in the model specification (e.g. by plotting residuals against predictors), and only if this doesn’t lead to success, overdispersion corrections such as changes in the distribution should be applied.

7.1.3 Plotting the residuals

We see various signals of overdispersion:

  • QQ: s-shaped QQ plot, distribution and dispersion test are signficant

  • Residual vs. predicted: Quantile fits are spread out too far. We get more residuals around 0 and 1, which means that more residuals are in the tail of distribution than would be expected under the fitted model.

7.1.4 Action

We need to adjust for overdispersion.

7.2 Adjusting for overdispersion

7.2.1 Quasi Poisson

The quasi-families augment the normal families by adding a dispersion parameter.

  • Dispersion parameter is estimated at 70.96 - a value similar to those in the overdispersion tests above.
  • Main effect are substantially larger standard errors, but an unchanged significance.
  • Note: because this is no maximum likelihood method no AIC available and no overdispersion tests can be conducted.
## 
## Call:
## glm(formula = flow_daysum ~ month + weekday + holiday + schoolholiday, 
##     family = "quasipoisson", data = miv_train_day)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -32.058   -3.600    0.305    4.201   25.765  
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.5652708  0.0179871 531.784  < 2e-16 ***
## month             0.0393792  0.0031529  12.490  < 2e-16 ***
## weekdayMonday    -0.0891981  0.0193010  -4.621 7.48e-06 ***
## weekdaySaturday  -0.2778933  0.0199542 -13.927  < 2e-16 ***
## weekdaySunday    -0.4484593  0.0209136 -21.443  < 2e-16 ***
## weekdayThursday  -0.0005565  0.0185878  -0.030    0.976    
## weekdayTuesday   -0.0187994  0.0186629  -1.007    0.315    
## weekdayWednesday -0.0015176  0.0186575  -0.081    0.935    
## holiday          -0.4839485  0.0341866 -14.156  < 2e-16 ***
## schoolholiday    -0.1200451  0.0149024  -8.055 1.30e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 70.96227)
## 
##     Null deviance: 106307  on 180  degrees of freedom
## Residual deviance:  12289  on 171  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

7.2.2 Step approach

## Start:  AIC=14367.92
## flow_daysum ~ month + weekday + holiday + schoolholiday
## 
## Call:  glm(formula = flow_daysum ~ month + weekday + holiday + schoolholiday, 
##     family = "poisson", data = miv_train_day)
## 
## Coefficients:
##      (Intercept)             month     weekdayMonday   weekdaySaturday  
##        9.5652708         0.0393792        -0.0891981        -0.2778933  
##    weekdaySunday   weekdayThursday    weekdayTuesday  weekdayWednesday  
##       -0.4484593        -0.0005565        -0.0187994        -0.0015176  
##          holiday     schoolholiday  
##       -0.4839485        -0.1200451  
## 
## Degrees of Freedom: 180 Total (i.e. Null);  171 Residual
## Null Deviance:       106300 
## Residual Deviance: 12290     AIC: 14370

7.2.3 Changed distribution: Negative Binomial

Negative binomial includes a dispersion parameter.

Already here we see that the ratio of deviance and df is near 1 (1.062164) and hence probably fine.

nbMod <- glm.nb(flow_daysum ~ month +  weekday + holiday + schoolholiday, data = miv_train_day)
summary(nbMod)
## 
## Call:
## glm.nb(formula = flow_daysum ~ month + weekday + holiday + schoolholiday, 
##     data = miv_train_day, init.theta = 178.1534899, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.8771  -0.4259   0.0185   0.4836   3.0326  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       9.5598179  0.0195613 488.712  < 2e-16 ***
## month             0.0409996  0.0033697  12.167  < 2e-16 ***
## weekdayMonday    -0.0865983  0.0211618  -4.092 4.27e-05 ***
## weekdaySaturday  -0.2777866  0.0209634 -13.251  < 2e-16 ***
## weekdaySunday    -0.4505642  0.0209618 -21.495  < 2e-16 ***
## weekdayThursday   0.0001086  0.0208994   0.005    0.996    
## weekdayTuesday   -0.0215609  0.0209031  -1.031    0.302    
## weekdayWednesday -0.0022003  0.0209340  -0.105    0.916    
## holiday          -0.4919153  0.0302063 -16.285  < 2e-16 ***
## schoolholiday    -0.1187551  0.0150871  -7.871 3.51e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(178.1535) family taken to be 1)
## 
##     Null deviance: 1423.96  on 180  degrees of freedom
## Residual deviance:  181.63  on 171  degrees of freedom
## AIC: 3054.1
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  178.2 
##           Std. Err.:  19.0 
## 
##  2 x log-likelihood:  -3032.052

7.2.3.1 Checking with Residual Plots

Dispersion test is still significant -> still overdispersion.

## 
##  DHARMa nonparametric dispersion test via mean deviance residual fitted
##  vs. simulated-refitted
## 
## data:  simulationNBMod
## dispersion = 0.97995, p-value = 0.016
## alternative hypothesis: two.sided

8 GAM

8.1 GAM with all predictive variables

gamMod <- mgcv::gam(flow_daysum ~ month +  weekday + holiday + schoolholiday, data = miv_train_day, method = "REML")

8.1.1 Summary

  • R-sq.(adj) = 0.887
  • Predictor significance unchanged to the linear model.
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## flow_daysum ~ month + weekday + holiday + schoolholiday
## 
## Parametric coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      14374.32     245.11  58.644  < 2e-16 ***
## month              561.09      42.21  13.293  < 2e-16 ***
## weekdayMonday    -1342.91     265.20  -5.064 1.06e-06 ***
## weekdaySaturday  -3873.14     262.59 -14.750  < 2e-16 ***
## weekdaySunday    -5808.48     262.39 -22.136  < 2e-16 ***
## weekdayThursday    -22.26     262.00  -0.085    0.932    
## weekdayTuesday    -247.26     262.02  -0.944    0.347    
## weekdayWednesday    23.16     262.41   0.088    0.930    
## holiday          -5861.86     377.31 -15.536  < 2e-16 ***
## schoolholiday    -1555.41     188.86  -8.236 4.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## 
## R-sq.(adj) =  0.887   Deviance explained = 89.3%
## -REML = 1431.2  Scale est. = 8.9221e+05  n = 181

8.1.2 Checking the residual plots

  • The longtail effect in the residuals is clearly visible. Otherwise the residuals are well distributed.
  • The residual plots have a some rise-and-fall pattern – clearly there is some dependence structure (it has something to do with intra-annual, inter-daily and inter-weekly fluctuations).
  • Let’s introduce a cyclical smoother for time variables.

## 
## Method: REML   Optimizer: outer newton
## full convergence after 3 iterations.
## Gradient range [-3.581009e-08,-3.581009e-08]
## (score 1431.182 & scale 892207.3).
## Hessian positive definite, eigenvalue range [85.5,85.5].
## Model rank =  10 / 10

8.2 Introducing a cyclical smooth term for month

gamMod_cs <- mgcv::gam(flow_daysum ~ s(month, bs = 'cc', k = 6) + weekday + holiday + schoolholiday, data = miv_train_day, method = "REML")

8.2.1 Summary

  • Significance hasn’t changed compared to teh linear model and GAM.
  • R-sq.(adj) = 0.778
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## flow_daysum ~ s(month, bs = "cc", k = 6) + weekday + holiday + 
##     schoolholiday
## 
## Parametric coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      16444.07     265.05  62.042  < 2e-16 ***
## weekdayMonday    -1379.52     372.46  -3.704 0.000287 ***
## weekdaySaturday  -3811.27     368.74 -10.336  < 2e-16 ***
## weekdaySunday    -5785.15     368.51 -15.699  < 2e-16 ***
## weekdayThursday    -68.40     367.98  -0.186 0.852755    
## weekdayTuesday    -310.43     367.94  -0.844 0.400024    
## weekdayWednesday   -32.26     368.51  -0.088 0.930354    
## holiday          -5620.81     533.90 -10.528  < 2e-16 ***
## schoolholiday    -2056.42     265.12  -7.757 7.62e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##            edf Ref.df     F p-value  
## s(month) 1.632      4 1.082  0.0566 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.778   Deviance explained = 78.9%
## -REML =   1496  Scale est. = 1.7593e+06  n = 181

8.2.2 Let’s look at the fitted smooth term:

  • Looking at the smooth term, we can see that the monthly smoother is picking up that monthly rise and fall of flow.

  • Looking at the relative magnitudes (i.e. monthly fluctuation vs. long-term trend), we can see how important it is to disintangle the components of the time series.

8.2.3 Let’s see what the model diagnostics look now:

## 
## Method: REML   Optimizer: outer newton
## full convergence after 4 iterations.
## Gradient range [-7.065438e-06,-2.041466e-07]
## (score 1496.047 & scale 1759326).
## Hessian positive definite, eigenvalue range [0.4583535,86.00779].
## Model rank =  13 / 13 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##            k'  edf k-index p-value    
## s(month) 4.00 1.63    0.27  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.3 Cyclical smooth terms for all time components

gamMod_3 <- mgcv::gam(flow_daysum ~ s(month, bs = 'cc', k = 6) + s(weekday_num, bs = 'cc', k = 7) + holiday + schoolholiday, data = miv_train_day)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## flow_daysum ~ s(month, bs = "cc", k = 6) + s(weekday_num, bs = "cc", 
##     k = 7) + holiday + schoolholiday
## 
## Parametric coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    14803.9      148.0 100.016  < 2e-16 ***
## holiday        -4851.9      709.6  -6.838 1.33e-10 ***
## schoolholiday  -2150.6      351.0  -6.126 5.91e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df      F p-value    
## s(month)       0.9206      4  0.352    0.21    
## s(weekday_num) 4.5307      5 34.569  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.598   Deviance explained = 61.5%
## GCV = 3.3362e+06  Scale est. = 3.1804e+06  n = 181

8.3.1 Let’s look at the fitted smooth terms:

The cyclical smoother for weekday is picking up the rise and fall of flow over the week.

8.3.2 Let’s see what the model diagnostics look now:

## 
## Method: GCV   Optimizer: magic
## Smoothing parameter selection converged after 9 iterations.
## The RMS GCV score gradient at convergence was 25.12427 .
## The Hessian was positive definite.
## Model rank =  12 / 12 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                   k'   edf k-index p-value    
## s(month)       4.000 0.921    0.79   0.005 ** 
## s(weekday_num) 5.000 4.531    0.44  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9 AIC - Goodness-of-fit

Compare the actual fit of all models:

  • The linear model and the gamMod are showing the best fit.

10 Prediction

10.1 Prediction on the existing data set miv_train_day

miv_pred_0 <- data.frame(time = miv_train_day$date,
                       flow = miv_train_day$flow_daysum,
                       predicted_values_0 = predict(linearMod, newdata = miv_train_day))

10.2 Prediction on the test data set miv_test

10.2.1 For the linear Model

miv_pred <- data.frame(time = miv_test$date,
                       flow = miv_test$flow_daysum,
                       predicted_values = predict(linearMod, newdata = miv_test, interval = "confidence"))

10.2.2 For the GAM

miv_pred_2 <- data.frame(time = miv_test$date,
                       flow = miv_test$flow_daysum,
                       predicted_values_2 = predict(gamMod, newdata = miv_test))